Explained Variance Score (explained_variance_score)#

explained_variance_score is a regression metric that asks a very specific question:

How much of the variability (spread) in \(y\) remains unexplained after using predictions \(\hat{y}\)?

It is similar to \(R^2\), but with an important twist: it uses the variance of the residuals and is therefore insensitive to constant (additive) bias.


Learning goals#

  • derive and interpret the metric (including edge cases)

  • understand its relationship to MSE and \(R^2\)

  • implement it from scratch in NumPy (sample weights + multi-output)

  • visualize how it reacts to noise vs bias

  • see what happens if you try to optimize it directly (gradient descent)

Notation#

  • True targets: \(y \in \mathbb{R}^n\)

  • Predictions: \(\hat{y} \in \mathbb{R}^n\)

  • Residuals: \(r = y - \hat{y}\)

  • Mean: \(\bar{z} = \frac{1}{n}\sum_{i=1}^n z_i\)

  • Variance (population / ddof=0): \(\mathrm{Var}(z) = \frac{1}{n}\sum_{i=1}^n (z_i - \bar{z})^2\)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition and intuition#

For a single regression target, explained variance is defined as

\[ \mathrm{EVS}(y, \hat{y}) = 1 - \frac{\mathrm{Var}(y - \hat{y})}{\mathrm{Var}(y)}. \]
  • Numerator: variance of the residuals \(r = y - \hat{y}\) (after removing the residual mean).

  • Denominator: variance of the target \(y\).

Interpretation:

  • 1.0: residuals have zero variance (they are all equal) → the model captures the shape of \(y\) perfectly.

  • 0.0: residual variance equals target variance → predictions do not reduce the spread.

  • < 0: residual variance is larger than target variance → worse than a naive baseline.

Key property: shift invariance (bias blindness)#

For any constant \(c\),

\[ \mathrm{EVS}(y, \hat{y}+c) = \mathrm{EVS}(y, \hat{y}), \]

because adding a constant shifts residuals by \(-c\) but does not change their variance.

Edge case: constant targets and force_finite#

If \(\mathrm{Var}(y)=0\) (all targets identical), the fraction above is ill-defined. scikit-learn uses force_finite=True by default and returns:

  • 1.0 if \(\mathrm{Var}(y-\hat{y})=0\) (e.g., predictions differ from \(y\) by a constant offset)

  • 0.0 otherwise

2) Relationship to MSE and \(R^2\)#

Let residuals be \(r = y - \hat{y}\). The mean squared error decomposes as:

\[ \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^n r_i^2 = \underbrace{\mathrm{Var}(r)}_{\text{spread}} + \underbrace{\bar{r}^2}_{\text{(additive) bias}^2}. \]

This yields:

\[ \mathrm{EVS} = 1 - \frac{\mathrm{Var}(r)}{\mathrm{Var}(y)}, \qquad R^2 = 1 - \frac{\mathrm{MSE}}{\mathrm{Var}(y)} = \mathrm{EVS} - \frac{\bar{r}^2}{\mathrm{Var}(y)}. \]

So \(R^2 \le \mathrm{EVS}\), with equality when the mean residual is zero (no additive bias).

Practical takeaway: explained variance measures how well you match the fluctuations of \(y\); it can look great even when you are systematically too high/low.

# Synthetic signal + noise
n = 400
x = np.linspace(0, 2 * np.pi, n)
y_signal = np.sin(x)
y_true = y_signal + rng.normal(0.0, 0.25, size=n)

# A few prediction candidates
preds = {
    'signal (denoised)': y_signal,
    'signal + constant bias': y_signal + 1.0,
    'mean baseline': np.full_like(y_true, y_true.mean()),
    'signal + extra noise': y_signal + rng.normal(0.0, 0.9, size=n),
    'wrong phase': np.sin(x + 0.7),
}


def summarize(y_true, y_pred):
    residuals = y_true - y_pred
    return {
        'EVS': explained_variance_score(y_true, y_pred),
        'R2': r2_score(y_true, y_pred),
        'MSE': mean_squared_error(y_true, y_pred),
        'residual_mean': float(residuals.mean()),
        'residual_var': float(residuals.var()),
    }


summaries = {name: summarize(y_true, y_pred) for name, y_pred in preds.items()}
summaries
{'signal (denoised)': {'EVS': 0.8944126033656888,
  'R2': 0.8944094638062047,
  'MSE': 0.05654500849911091,
  'residual_mean': -0.0012966387524886791,
  'residual_var': 0.05654332722705647},
 'signal + constant bias': {'EVS': 0.8944126033656888,
  'R2': -0.9778046284015942,
  'MSE': 1.0591382860040883,
  'residual_mean': -1.0012966387524886,
  'residual_var': 0.05654332722705647},
 'mean baseline': {'EVS': 0.0,
  'R2': 0.0,
  'MSE': 0.53551208789518,
  'residual_mean': -3.552713678800501e-17,
  'residual_var': 0.53551208789518},
 'signal + extra noise': {'EVS': -0.6971716305768838,
  'R2': -0.7011039492575686,
  'MSE': 0.9109617275936569,
  'residual_mean': 0.04588904212295869,
  'residual_var': 0.9088559234066942},
 'wrong phase': {'EVS': 0.473680853439166,
  'R2': 0.47366507095168253,
  'MSE': 0.2818587167868259,
  'residual_mean': -0.0029071829705829798,
  'residual_var': 0.28185026507400146}}
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=x,
        y=y_true,
        mode='markers',
        name='y_true',
        marker=dict(size=4, opacity=0.5),
    )
)

for name, y_pred in preds.items():
    evs = summaries[name]['EVS']
    r2 = summaries[name]['R2']
    fig.add_trace(
        go.Scatter(
            x=x,
            y=y_pred,
            mode='lines',
            name=f'{name} (EVS={evs:.3f}, R2={r2:.3f})',
        )
    )

fig.update_layout(
    title='Same residual variance ⇒ same EVS (bias does not matter)',
    xaxis_title='x',
    yaxis_title='y',
    legend_title='Predictor',
)
fig
fig = go.Figure()

bins = dict(start=-3, end=3, size=0.1)
for name, y_pred in preds.items():
    residuals = y_true - y_pred
    fig.add_trace(
        go.Histogram(
            x=residuals,
            name=f"{name} (mean={residuals.mean():.2f}, std={residuals.std():.2f})",
            opacity=0.55,
            xbins=bins,
        )
    )

fig.update_layout(
    title='Residual distributions (EVS uses variance, not the mean)',
    xaxis_title='residual (y_true - y_pred)',
    yaxis_title='count',
    barmode='overlay',
)
fig
biases = np.linspace(-2.5, 2.5, 101)
evs_vals = []
r2_vals = []

base = preds['signal (denoised)']
for c in biases:
    y_pred = base + c
    evs_vals.append(explained_variance_score(y_true, y_pred))
    r2_vals.append(r2_score(y_true, y_pred))

fig = go.Figure()
fig.add_trace(go.Scatter(x=biases, y=evs_vals, mode='lines', name='EVS'))
fig.add_trace(go.Scatter(x=biases, y=r2_vals, mode='lines', name='R2'))
fig.update_layout(
    title='Adding a constant offset changes R2 but not explained variance',
    xaxis_title='constant offset c added to predictions',
    yaxis_title='score',
)
fig
sigmas = np.linspace(0.0, 2.0, 31)
var_y = float(np.var(y_true))
evs_theory = 1.0 - (sigmas**2) / var_y

n_trials = 300
rng_sim = np.random.default_rng(123)
evs_sim = []
for sigma in sigmas:
    scores = []
    for _ in range(n_trials):
        y_pred = y_true + rng_sim.normal(0.0, sigma, size=n)
        scores.append(explained_variance_score(y_true, y_pred))
    evs_sim.append(float(np.mean(scores)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=sigmas, y=evs_theory, mode='lines', name='theory: 1 - σ²/Var(y)'))
fig.add_trace(
    go.Scatter(
        x=sigmas,
        y=evs_sim,
        mode='markers+lines',
        name=f'simulation mean ({n_trials} trials)',
    )
)
fig.update_layout(
    title='Independent noise reduces EVS proportionally to its variance',
    xaxis_title='noise σ (y_pred = y_true + ε, ε ~ N(0, σ²))',
    yaxis_title='explained variance score',
)
fig

3) NumPy implementation (from scratch)#

scikit-learn supports:

  • optional sample_weight (per-sample weights)

  • multi-output regression (shape (n_samples, n_outputs))

  • output aggregation via multioutput:

    • raw_values: one score per output

    • uniform_average: mean across outputs

    • variance_weighted: mean weighted by each output’s target variance

Below is a compact NumPy implementation that matches sklearn.metrics.explained_variance_score (including force_finite).

def _to_2d(a):
    a = np.asarray(a)
    if a.ndim == 1:
        return a.reshape(-1, 1)
    if a.ndim == 2:
        return a
    raise ValueError(f'Expected 1D or 2D array, got shape {a.shape}')


def _check_sample_weight(sample_weight, n_samples):
    if sample_weight is None:
        return None
    w = np.asarray(sample_weight, dtype=float)
    if w.ndim != 1:
        raise ValueError(f'sample_weight must be 1D, got shape {w.shape}')
    if w.shape[0] != n_samples:
        raise ValueError(f'sample_weight length {w.shape[0]} != n_samples {n_samples}')
    return w


def _weighted_mean(a, sample_weight):
    return np.average(a, weights=sample_weight, axis=0)


def _weighted_variance(a, sample_weight):
    mean = _weighted_mean(a, sample_weight)
    return np.average((a - mean) ** 2, weights=sample_weight, axis=0)


def explained_variance_score_np(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput='uniform_average',
    force_finite=True,
):
    y_true = _to_2d(y_true)
    y_pred = _to_2d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f'y_true shape {y_true.shape} != y_pred shape {y_pred.shape}')

    n_samples, n_outputs = y_true.shape
    w = _check_sample_weight(sample_weight, n_samples)

    residual = y_true - y_pred
    numerator = _weighted_variance(residual, w)
    denominator = _weighted_variance(y_true, w)

    with np.errstate(divide='ignore', invalid='ignore'):
        output_scores = 1.0 - (numerator / denominator)

    if force_finite:
        zero_den = denominator == 0
        output_scores = output_scores.copy()
        output_scores[zero_den & (numerator == 0)] = 1.0
        output_scores[zero_den & (numerator != 0)] = 0.0

    if multioutput == 'raw_values':
        return output_scores if n_outputs > 1 else float(output_scores[0])

    if multioutput == 'uniform_average':
        return float(np.mean(output_scores))

    if multioutput == 'variance_weighted':
        if np.all(denominator == 0):
            return float(np.mean(output_scores))
        return float(np.average(output_scores, weights=denominator))

    # array-like weights
    mo = np.asarray(multioutput, dtype=float)
    if mo.shape != (n_outputs,):
        raise ValueError(f'multioutput weights must have shape ({n_outputs},), got {mo.shape}')
    return float(np.average(output_scores, weights=mo))


def mean_squared_error_np(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average'):
    y_true = _to_2d(y_true)
    y_pred = _to_2d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f'y_true shape {y_true.shape} != y_pred shape {y_pred.shape}')

    n_samples, n_outputs = y_true.shape
    w = _check_sample_weight(sample_weight, n_samples)
    raw = np.average((y_true - y_pred) ** 2, weights=w, axis=0)

    if multioutput == 'raw_values':
        return raw if n_outputs > 1 else float(raw[0])
    if multioutput == 'uniform_average':
        return float(np.mean(raw))

    mo = np.asarray(multioutput, dtype=float)
    if mo.shape != (n_outputs,):
        raise ValueError(f'multioutput weights must have shape ({n_outputs},), got {mo.shape}')
    return float(np.average(raw, weights=mo))


def r2_score_np(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput='uniform_average',
    force_finite=True,
):
    y_true = _to_2d(y_true)
    y_pred = _to_2d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f'y_true shape {y_true.shape} != y_pred shape {y_pred.shape}')

    n_samples, n_outputs = y_true.shape
    w = _check_sample_weight(sample_weight, n_samples)

    numerator = np.average((y_true - y_pred) ** 2, weights=w, axis=0)
    denominator = _weighted_variance(y_true, w)

    with np.errstate(divide='ignore', invalid='ignore'):
        output_scores = 1.0 - (numerator / denominator)

    if force_finite:
        zero_den = denominator == 0
        output_scores = output_scores.copy()
        output_scores[zero_den & (numerator == 0)] = 1.0
        output_scores[zero_den & (numerator != 0)] = 0.0

    if multioutput == 'raw_values':
        return output_scores if n_outputs > 1 else float(output_scores[0])
    if multioutput == 'uniform_average':
        return float(np.mean(output_scores))
    if multioutput == 'variance_weighted':
        if np.all(denominator == 0):
            return float(np.mean(output_scores))
        return float(np.average(output_scores, weights=denominator))

    mo = np.asarray(multioutput, dtype=float)
    if mo.shape != (n_outputs,):
        raise ValueError(f'multioutput weights must have shape ({n_outputs},), got {mo.shape}')
    return float(np.average(output_scores, weights=mo))
# Quick parity checks vs scikit-learn (random data)
for n_outputs in [1, 3]:
    y_true = rng.normal(size=(200, n_outputs))
    y_pred = y_true + rng.normal(scale=0.8, size=(200, n_outputs))
    w = rng.uniform(0.1, 2.0, size=200)

    for mo in ['raw_values', 'uniform_average', 'variance_weighted']:
        sk = explained_variance_score(y_true, y_pred, sample_weight=w, multioutput=mo)
        ours = explained_variance_score_np(y_true, y_pred, sample_weight=w, multioutput=mo)
        np.testing.assert_allclose(ours, sk, rtol=1e-12, atol=1e-12)

# Degenerate denominator behavior (constant y)
y_const = np.ones(10)
y_const_pred_const = np.zeros_like(y_const)
y_const_pred_vary = np.arange(y_const.size)

np.testing.assert_allclose(
    explained_variance_score_np(y_const, y_const_pred_const),
    explained_variance_score(y_const, y_const_pred_const),
)
np.testing.assert_allclose(
    explained_variance_score_np(y_const, y_const_pred_vary),
    explained_variance_score(y_const, y_const_pred_vary),
)

# `force_finite=False` matches NaN/-inf conventions
np.testing.assert_allclose(
    explained_variance_score_np(y_const, y_const_pred_const, force_finite=False),
    explained_variance_score(y_const, y_const_pred_const, force_finite=False),
    equal_nan=True,
)
np.testing.assert_allclose(
    explained_variance_score_np(y_const, y_const_pred_vary, force_finite=False),
    explained_variance_score(y_const, y_const_pred_vary, force_finite=False),
    equal_nan=True,
)

'All checks passed.'
/home/tempa/miniconda3/lib/python3.12/site-packages/sklearn/metrics/_regression.py:930: RuntimeWarning:

invalid value encountered in divide

/home/tempa/miniconda3/lib/python3.12/site-packages/sklearn/metrics/_regression.py:930: RuntimeWarning:

divide by zero encountered in divide
'All checks passed.'

4) Multi-output aggregation (multioutput)#

For multi-output regression (shape (n_samples, n_outputs)), EVS is computed per output first.

Then multioutput decides how to combine them:

  • raw_values: return one score per output

  • uniform_average: mean of the per-output scores

  • variance_weighted: weighted mean using each output’s target variance (outputs with more variance count more)

This matters when different targets live on very different scales.

n = 300
y1 = 10 * rng.normal(size=n)  # high-variance target
y2 = 0.2 * rng.normal(size=n)  # low-variance target
y_true_mo = np.column_stack([y1, y2])

# Predictions: do well on y2, poorly on y1
y_pred_mo = np.column_stack(
    [
        y1 + rng.normal(0, 15, size=n),
        y2 + rng.normal(0, 0.05, size=n),
    ]
)

raw = explained_variance_score(y_true_mo, y_pred_mo, multioutput='raw_values')
uniform = explained_variance_score(y_true_mo, y_pred_mo, multioutput='uniform_average')
var_w = explained_variance_score(y_true_mo, y_pred_mo, multioutput='variance_weighted')

raw, uniform, var_w
(array([-1.1705,  0.931 ]), -0.11973865514890675, -1.169799844269795)
fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=['output 1 (high var)', 'output 2 (low var)'],
        y=raw,
        name='raw EVS',
    )
)
fig.update_layout(
    title=f'Aggregation: uniform={uniform:.3f}, variance_weighted={var_w:.3f}',
    yaxis_title='explained variance score',
)
fig

5) Sample weights (sample_weight)#

With per-sample weights \(w_i \ge 0\), sklearn uses weighted means and variances:

\[ \bar{z}_w = \frac{\sum_i w_i z_i}{\sum_i w_i}, \qquad \mathrm{Var}_w(z) = \frac{\sum_i w_i (z_i - \bar{z}_w)^2}{\sum_i w_i}. \]

So samples with larger weight influence both the numerator and denominator more. A common use case is heteroscedastic noise (some samples are much noisier / less reliable).

n = 240
x = rng.normal(size=n)
y_clean = 2.0 * x

# First third is much noisier
mask_noisy = np.arange(n) < (n // 3)
noise = rng.normal(0.0, 0.2, size=n)
noise[mask_noisy] = rng.normal(0.0, 1.5, size=mask_noisy.sum())

y_true_w = y_clean + noise
y_pred_w = y_clean  # model predicts the underlying signal

weights = np.ones(n)
weights[mask_noisy] = 0.2  # downweight noisy samples

score_unweighted = explained_variance_score(y_true_w, y_pred_w)
score_weighted = explained_variance_score(y_true_w, y_pred_w, sample_weight=weights)
score_unweighted, score_weighted
(0.8436035954505202, 0.9504664689590617)
order = np.argsort(x)

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=x,
        y=y_true_w,
        mode='markers',
        name='y_true',
        marker=dict(
            size=6,
            color=weights,
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(title='weight'),
            opacity=0.85,
        ),
    )
)

fig.add_trace(
    go.Scatter(
        x=x[order],
        y=y_pred_w[order],
        mode='lines',
        name='y_pred (signal)',
        line=dict(color='black'),
    )
)

fig.update_layout(
    title=(
        'Downweighting noisy samples increases EVS: '
        f'unweighted={score_unweighted:.3f}, weighted={score_weighted:.3f}'
    ),
    xaxis_title='x',
    yaxis_title='y',
)
fig

6) Optimizing EVS directly (linear regression)#

explained_variance_score is primarily an evaluation metric. But it’s instructive to see what happens if you try to optimize it.

Maximizing EVS is equivalent to minimizing the residual variance:

\[ \max \mathrm{EVS} \quad \Longleftrightarrow \quad \min \mathrm{Var}(r), \; r = y - \hat{y}. \]

For a linear model \(\hat{y} = Xw + b\), define centered residuals \(r_c = r - \bar{r}\) and the loss \(L = \frac{1}{n}\sum_{i=1}^n r_{c,i}^2\).

The gradients are

\[ \nabla_w L = -\frac{2}{n} X^\top r_c, \qquad \frac{\partial L}{\partial b} = -\frac{2}{n}\sum_i r_{c,i} = 0. \]

So the intercept is not identifiable under this objective: any constant shift in \(\hat{y}\) leaves EVS unchanged. We’ll see that in gradient descent: \(b\) barely moves, EVS can look good, but \(R^2\) / MSE reveal the bias.

def fit_linear_gd(
    X,
    y,
    *,
    loss,
    lr=0.05,
    n_steps=300,
    w0=None,
    b0=0.0,
):
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float).reshape(-1)
    n_samples, n_features = X.shape

    w = np.zeros(n_features) if w0 is None else np.asarray(w0, dtype=float).copy()
    b = float(b0)

    history = {'step': [], 'mse': [], 'r2': [], 'evs': [], 'b': []}
    for step in range(n_steps):
        y_pred = X @ w + b
        r = y - y_pred

        if loss == 'mse':
            grad_w = -(2.0 / n_samples) * (X.T @ r)
            grad_b = -(2.0 / n_samples) * float(np.sum(r))
        elif loss == 'resid_var':
            r_c = r - r.mean()
            grad_w = -(2.0 / n_samples) * (X.T @ r_c)
            grad_b = -(2.0 / n_samples) * float(np.sum(r_c))  # always ~0
        else:
            raise ValueError("loss must be 'mse' or 'resid_var'")

        w -= lr * grad_w
        b -= lr * grad_b

        y_pred = X @ w + b
        history['step'].append(step)
        history['mse'].append(mean_squared_error_np(y, y_pred))
        history['r2'].append(r2_score_np(y, y_pred))
        history['evs'].append(explained_variance_score_np(y, y_pred))
        history['b'].append(b)

    return w, b, history


# Synthetic linear regression problem
n = 500
X = rng.normal(size=(n, 1))
w_true = np.array([2.5])
b_true = 1.7
y = b_true + X @ w_true + rng.normal(0.0, 0.5, size=n)
y = y.reshape(-1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

w_mse, b_mse, hist_mse = fit_linear_gd(X_train, y_train, loss='mse', lr=0.05, n_steps=300)
w_evs, b_evs, hist_evs = fit_linear_gd(X_train, y_train, loss='resid_var', lr=0.05, n_steps=300)


def evaluate(X_, y_, w, b):
    y_pred = X_ @ w + b
    return {
        'MSE': mean_squared_error_np(y_, y_pred),
        'R2': r2_score_np(y_, y_pred),
        'EVS': explained_variance_score_np(y_, y_pred),
        'residual_mean': float(np.mean(y_ - y_pred)),
    }


b_evs_post = float(y_train.mean() - (X_train.mean(axis=0) @ w_evs))

results = {
    'train (MSE loss)': evaluate(X_train, y_train, w_mse, b_mse),
    'test (MSE loss)': evaluate(X_test, y_test, w_mse, b_mse),
    'train (EVS loss)': evaluate(X_train, y_train, w_evs, b_evs),
    'test (EVS loss)': evaluate(X_test, y_test, w_evs, b_evs),
    'test (EVS loss + post-hoc intercept)': evaluate(X_test, y_test, w_evs, b_evs_post),
}

{
    'true': {'w': w_true, 'b': b_true},
    'trained_on_mse': {'w': w_mse, 'b': b_mse},
    'trained_on_evs': {'w': w_evs, 'b': b_evs},
    'evs_posthoc_intercept': b_evs_post,
    'metrics': results,
}
{'true': {'w': array([2.5]), 'b': 1.7},
 'trained_on_mse': {'w': array([2.4944]), 'b': 1.6951145770136242},
 'trained_on_evs': {'w': array([2.4944]), 'b': -7.917476198469688e-17},
 'evs_posthoc_intercept': 1.6951145770146998,
 'metrics': {'train (MSE loss)': {'MSE': 0.22740982130503978,
   'R2': 0.9646517961590391,
   'EVS': 0.9646517961590391,
   'residual_mean': 9.732659123073972e-13},
  'test (MSE loss)': {'MSE': 0.2563190118563668,
   'R2': 0.9652494187042723,
   'EVS': 0.9662444560416645,
   'residual_mean': 0.0856700581900787},
  'train (EVS loss)': {'MSE': 3.1008232505127644,
   'R2': 0.5180131987928035,
   'EVS': 0.9646517961590391,
   'residual_mean': 1.6951145770147},
  'test (EVS loss)': {'MSE': 3.4201735699636053,
   'R2': 0.5363082167501443,
   'EVS': 0.9662444560416775,
   'residual_mean': 1.780784635203714},
  'test (EVS loss + post-hoc intercept)': {'MSE': 0.2563190118560888,
   'R2': 0.9652494187043099,
   'EVS': 0.9662444560416775,
   'residual_mean': 0.08567005818901435}}}
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=hist_mse['step'],
        y=hist_mse['evs'],
        mode='lines',
        name='EVS (trained on MSE)',
    )
)
fig.add_trace(
    go.Scatter(
        x=hist_evs['step'],
        y=hist_evs['evs'],
        mode='lines',
        name='EVS (trained on Var(resid))',
    )
)

fig.add_trace(
    go.Scatter(
        x=hist_mse['step'],
        y=hist_mse['r2'],
        mode='lines',
        name='R2 (trained on MSE)',
        line=dict(dash='dash'),
    )
)
fig.add_trace(
    go.Scatter(
        x=hist_evs['step'],
        y=hist_evs['r2'],
        mode='lines',
        name='R2 (trained on Var(resid))',
        line=dict(dash='dash'),
    )
)

fig.update_layout(
    title='Training curves: EVS can look good even with a biased intercept',
    xaxis_title='gradient descent step',
    yaxis_title='score',
)
fig
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_mse['step'], y=hist_mse['b'], mode='lines', name='b (MSE loss)'))
fig.add_trace(go.Scatter(x=hist_evs['step'], y=hist_evs['b'], mode='lines', name='b (EVS loss)'))
fig.update_layout(
    title='Intercept during training (EVS-loss has ~zero gradient)',
    xaxis_title='step',
    yaxis_title='b',
)
fig
y_pred_test_mse = X_test @ w_mse + b_mse
y_pred_test_evs = X_test @ w_evs + b_evs
y_pred_test_evs_post = X_test @ w_evs + b_evs_post

low = float(
    np.min(
        [y_test.min(), y_pred_test_mse.min(), y_pred_test_evs.min(), y_pred_test_evs_post.min()]
    )
)
high = float(
    np.max(
        [y_test.max(), y_pred_test_mse.max(), y_pred_test_evs.max(), y_pred_test_evs_post.max()]
    )
)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y_test, y=y_pred_test_mse, mode='markers', name='trained on MSE', opacity=0.7))
fig.add_trace(
    go.Scatter(x=y_test, y=y_pred_test_evs, mode='markers', name='trained on Var(resid)', opacity=0.7)
)
fig.add_trace(
    go.Scatter(
        x=y_test,
        y=y_pred_test_evs_post,
        mode='markers',
        name='Var(resid) + post-hoc intercept',
        opacity=0.7,
    )
)

fig.add_trace(
    go.Scatter(
        x=[low, high],
        y=[low, high],
        mode='lines',
        name='ideal',
        line=dict(color='black', dash='dot'),
    )
)

fig.update_layout(
    title='Predicted vs true (test): EVS-loss can be systematically shifted',
    xaxis_title='y_true',
    yaxis_title='y_pred',
)
fig

7) Practical usage (scikit-learn)#

In practice you almost never train by maximizing EVS; you train with a proper loss (e.g., MSE/MAE) and evaluate with EVS. Here’s the standard sklearn pattern:

lin = LinearRegression().fit(X_train, y_train)
y_pred = lin.predict(X_test)

{
    'coef': lin.coef_,
    'intercept': lin.intercept_,
    'EVS': explained_variance_score(y_test, y_pred),
    'R2': r2_score(y_test, y_pred),
    'MSE': mean_squared_error(y_test, y_pred),
}
{'coef': array([2.4944]),
 'intercept': 1.6951145770147051,
 'EVS': 0.9662444560416782,
 'R2': 0.9652494187043107,
 'MSE': 0.2563190118560831}

8) Pros, cons, and when to use it#

Pros#

  • Unitless and easy to compare across scales (like \(R^2\)).

  • Directly measures how much residual spread remains relative to the target spread.

  • Supports sample weighting and multi-output aggregation.

Cons / pitfalls#

  • Ignores additive bias: a constant offset does not change the score (can be very misleading).

  • Not a great training objective: intercept is not identifiable (zero gradient under the EVS-equivalent loss).

  • For nearly-constant targets, the metric can be unstable/ill-defined; force_finite=True can hide that.

Good use cases#

  • You care about matching fluctuations and will correct offsets separately (post-hoc calibration / de-meaning).

  • Targets are naturally centered/detrended, so additive bias is not meaningful.

  • Multi-output settings where you want to quantify variance capture per output and then aggregate.

9) Diagnostics: what to check alongside EVS#

Because EVS can be blind to bias, pair it with:

  • Residual mean (bias): \(\bar{r}\)

  • MAE / MSE / RMSE (accuracy)

  • \(R^2\) (captures both variance + bias effects)

  • plots: \(y\) vs \(\hat{y}\), residual histogram, residuals vs features/time

10) Exercises#

  1. Construct two predictors with the same EVS but very different \(R^2\). Explain using the MSE decomposition.

  2. Implement multioutput weighting by hand for 3 outputs and verify against sklearn.

  3. In the gradient descent section, initialize \(b\) to a large value and show that the EVS-loss optimizer does not fix it.

  4. (Time series) Detrend a signal, fit a model, and compare EVS before/after detrending.

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html

  • scikit-learn regression metrics overview: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics